CNN經典論文實戰(一)--LeNet與AlexNet

2023 iThome 鐵人賽

DAY 14

AI & Data

AI白話文運動系列之「A！給我那張Image！」系列第 14 篇

15th鐵人賽

理工哈士奇嗷嗚嗷嗚

2023-09-29 20:45:41

1113 瀏覽

分享至

前言

昨天我們講了一些歷史故事，主要是在介紹CNN界的兩位始祖：LeNet與AlexNet，一來是讓大家知道CNN的歷史比我們想像的還要久，二來也是藉由他們說明主流的CNN架構如何設計。今天我們回到實作的部分，讓大家實際感受一下這兩種模型的運作方式。

先備知識

Python(至少對Python語法不陌生)
物件導向(至少需要知道class, function等概念)
LeNet-5與AlexNet架構的特點(可以回顧：https://ithelp.ithome.com.tw/articles/10330192 )
捲積運算(可以回顧：https://ithelp.ithome.com.tw/articles/10323076 )
捲積神經網路(可以回顧：https://ithelp.ithome.com.tw/articles/10323077 )

看完今天的內容你可能會知道......

如何建構LeNet-5模型
LeNet-5中的特殊架構怎麼處理
如何建構AlexNet模型

一、LeNet-5 Pytorch實戰

昨天只有稍微介紹過LeNet-5的歷史背景，今天我們搭配著程式碼具體的來看一下每個細節！

1. LeNet-5模型架構

我們昨天有提到，LeNet-5當中的第二層與第三層之間的連接方式與目前主流的方式有些差異，他並不是所有的捲積核都會跟輸入圖作用，而是按照下表的方式一一對應：

class LeNet5(nn.Module):
    def __init__(self):
        super(LeNet5, self).__init__()
        # Layer C1: Convolutional layer
        self.c1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, padding=2)
        # Layer S2: Sub-sampling layer (Max-Pooling)
        self.s2 = nn.MaxPool2d(kernel_size=2, stride=2)
        # Layer C3: Convolutional layer with special connections
        self.c3_1 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
        self.c3_2 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
        self.c3_3 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
        self.c3_4 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
        self.c3_5 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
        self.c3_6 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
        self.c3_7 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
        self.c3_8 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
        self.c3_9 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
        self.c3_10 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
        self.c3_11 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
        self.c3_12 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
        self.c3_13 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
        self.c3_14 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
        self.c3_15 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
        self.c3_16 = nn.Conv2d(in_channels=6, out_channels=1, kernel_size=5)

        # Layer S4: Sub-sampling layer (Max-Pooling)
        self.s4 = nn.MaxPool2d(kernel_size=2, stride=2)
        # Layer C5: Fully connected layer
        self.c5 = nn.Linear(5 * 5 * 16, 120)
        # Layer F6: Fully connected layer
        self.f6 = nn.Linear(120, 84)
        # Output layer
        self.output = nn.Linear(84, 10)

    def forward(self, x):
        # Layer C1: Convolutional layer
        x = torch.relu(self.c1(x))
        # Layer S2: Sub-sampling layer
        x = self.s2(x)
        # Layer C3: Convolutional layer with special connections
        x1 = torch.relu(self.c3_1(x[:,:3,:,:]))
        x2 = torch.relu(self.c3_2(x[:,1:4,:,:]))
        x3 = torch.relu(self.c3_3(x[:,2:5,:,:]))
        x4 = torch.relu(self.c3_4(x[:,3:6,:,:]))
        x5 = torch.relu(self.c3_5(torch.cat((x[:,:1,:,:], x[:,4:6,:,:]), dim=1)))
        x6 = torch.relu(self.c3_6(torch.cat((x[:,:2,:,:], x[:,5:6,:,:]), dim=1)))
        x7 = torch.relu(self.c3_7(x[:,0:4,:,:]))
        x8 = torch.relu(self.c3_8(x[:,1:5,:,:]))
        x9 = torch.relu(self.c3_9(x[:,2:6,:,:]))
        x10 = torch.relu(self.c3_10(torch.cat((x[:,:1,:,:], x[:,3:6,:,:]), dim=1)))
        x11 = torch.relu(self.c3_11(torch.cat((x[:,:2,:,:], x[:,4:6,:,:]), dim=1)))
        x12 = torch.relu(self.c3_12(torch.cat((x[:,:3,:,:], x[:,5:6,:,:]), dim=1)))
        x13 = torch.relu(self.c3_13(torch.cat((x[:,:2,:,:], x[:,3:5,:,:]), dim=1)))
        x14 = torch.relu(self.c3_14(torch.cat((x[:,1:3,:,:], x[:,4:6,:,:]), dim=1)))
        x15 = torch.relu(self.c3_15(torch.cat((x[:,:1,:,:], x[:,2:4,:,:], x[:,5:6,:,:]), dim=1)))
        x16 = torch.relu(self.c3_16(x))
        x = torch.cat((x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16), dim=1)
        # Layer S4: Sub-sampling layer
        x = self.s4(x)
        # Flatten the feature maps for fully connected layers
        x = x.view(x.size(0), -1)
        # Layer C5: Fully connected layer
        x = torch.relu(self.c5(x))
        # Layer F6: Fully connected layer
        x = torch.relu(self.f6(x))
        # Output layer
        x = self.output(x)
        return x

為了達成這樣的架構，我們在S2的輸出結果與C3的輸入之間獨立建構了16的捲積核，分別依照上表的方式處理特定的輸入圖的某些通道。根據這樣的組合方式，我們可以計算出S2與C3之間所需要學習的總參數量：(5*5*3+1)*6+(5*5*4+1)*9+(5*5*6+1)*1=1516
註記：單個捲積核中的參數量為每個捲積核當中的元素個素，因此如果輸入通道為C個，捲積核大小為KxK，則總參數量為(KxKxC+1)，上面我們使用了6個輸入通道為3的捲積核+9個輸入通道為4的捲積核+一個輸入通道為1的捲積核，因此總數才是1516。公式中的常數1是因為我們通常會給每個捲積核一個可學習的參數：Bias，用來調整該捲積核中的所有元素。

2. 訓練流程

介紹完模型架構之後，我們把這樣的架構應用在MNIST手寫數字辨識資料集上面試試看！

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms

batch_size = 64
learning_rate = 0.001
num_epochs = 10

transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])

train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transform, download=True)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)

test_dataset = torchvision.datasets.MNIST(root='./data', train=False, transform=transform, download=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

model = LeNet5()

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

total_step = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward pass and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (i + 1) % 100 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{total_step}], Loss: {loss.item():.4f}')

print('Training finished.')

3. 評估模型表現

訓練好模型之後，我們可以利用測試資料集當中的資料評估模型訓練成果的好壞(測試資料不同於訓練資料，是模型從未看過的新資料)。
在測試模型的時候，有兩個關鍵：測試模式以及不計算梯度。前者指的是model.eval()，後者指的是with torch.no_grad():，目的在於告訴模型說現在我們只是在測試模型，不需要訓練，所以不用計算梯度，也不需要更新參數。這樣的另外一個好處是因為少了很多步驟，所以可以讓整個流程變快一點。

model.eval()  # Set model to evaluation mode

correct = 0
total = 0

with torch.no_grad():
    for images, labels in test_loader:
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Test Accuracy: {100 * correct / total}%')

4. 完整程式碼

值得注意的是，我們這次的實作中使用的激勵函數、下取樣層與輸出層並非論文原文所使用的，這部分大家可以自行替換，會有不同的效果。我們今天的內容只著重在討論LeNet-5中大家最容易產生問題的部分：S2與C3之間的連接，以及參數量如何計算。


import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms

class LeNet5(nn.Module):
    def __init__(self):
        super(LeNet5, self).__init__()
        # Layer C1: Convolutional layer
        self.c1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, padding=2)
        # Layer S2: Sub-sampling layer (Max-Pooling)
        self.s2 = nn.MaxPool2d(kernel_size=2, stride=2)
        # Layer C3: Convolutional layer with special connections
        self.c3_1 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
        self.c3_2 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
        self.c3_3 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
        self.c3_4 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
        self.c3_5 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
        self.c3_6 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
        self.c3_7 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
        self.c3_8 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
        self.c3_9 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
        self.c3_10 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
        self.c3_11 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
        self.c3_12 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
        self.c3_13 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
        self.c3_14 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
        self.c3_15 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
        self.c3_16 = nn.Conv2d(in_channels=6, out_channels=1, kernel_size=5)

        # Layer S4: Sub-sampling layer (Max-Pooling)
        self.s4 = nn.MaxPool2d(kernel_size=2, stride=2)
        # Layer C5: Fully connected layer
        self.c5 = nn.Linear(5 * 5 * 16, 120)
        # Layer F6: Fully connected layer
        self.f6 = nn.Linear(120, 84)
        # Output layer
        self.output = nn.Linear(84, 10)

    def forward(self, x):
        # Layer C1: Convolutional layer
        x = torch.relu(self.c1(x))
        # Layer S2: Sub-sampling layer
        x = self.s2(x)
        # Layer C3: Convolutional layer with special connections
        x1 = torch.relu(self.c3_1(x[:,:3,:,:]))
        x2 = torch.relu(self.c3_2(x[:,1:4,:,:]))
        x3 = torch.relu(self.c3_3(x[:,2:5,:,:]))
        x4 = torch.relu(self.c3_4(x[:,3:6,:,:]))
        x5 = torch.relu(self.c3_5(torch.cat((x[:,:1,:,:], x[:,4:6,:,:]), dim=1)))
        x6 = torch.relu(self.c3_6(torch.cat((x[:,:2,:,:], x[:,5:6,:,:]), dim=1)))
        x7 = torch.relu(self.c3_7(x[:,0:4,:,:]))
        x8 = torch.relu(self.c3_8(x[:,1:5,:,:]))
        x9 = torch.relu(self.c3_9(x[:,2:6,:,:]))
        x10 = torch.relu(self.c3_10(torch.cat((x[:,:1,:,:], x[:,3:6,:,:]), dim=1)))
        x11 = torch.relu(self.c3_11(torch.cat((x[:,:2,:,:], x[:,4:6,:,:]), dim=1)))
        x12 = torch.relu(self.c3_12(torch.cat((x[:,:3,:,:], x[:,5:6,:,:]), dim=1)))
        x13 = torch.relu(self.c3_13(torch.cat((x[:,:2,:,:], x[:,3:5,:,:]), dim=1)))
        x14 = torch.relu(self.c3_14(torch.cat((x[:,1:3,:,:], x[:,4:6,:,:]), dim=1)))
        x15 = torch.relu(self.c3_15(torch.cat((x[:,:1,:,:], x[:,2:4,:,:], x[:,5:6,:,:]), dim=1)))
        x16 = torch.relu(self.c3_16(x))
        x = torch.cat((x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16), dim=1)
        # Layer S4: Sub-sampling layer
        x = self.s4(x)
        # Flatten the feature maps for fully connected layers
        x = x.view(x.size(0), -1)
        # Layer C5: Fully connected layer
        x = torch.relu(self.c5(x))
        # Layer F6: Fully connected layer
        x = torch.relu(self.f6(x))
        # Output layer
        x = self.output(x)
        return x


batch_size = 64
learning_rate = 0.001
num_epochs = 10

transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])

train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transform, download=True)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)

test_dataset = torchvision.datasets.MNIST(root='./data', train=False, transform=transform, download=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

model = LeNet5()

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

total_step = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward pass and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (i + 1) % 100 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{total_step}], Loss: {loss.item():.4f}')

print('Training finished.')

model.eval()  # Set model to evaluation mode

correct = 0
total = 0

with torch.no_grad():
    for images, labels in test_loader:
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Test Accuracy: {100 * correct / total}%')

二、AlexNet Pytorch實戰

與LeNet-5的流程相同，昨天介紹完歷史背景的部分，今天我們則是藉由程式碼實際看一下AlexNet中每個架構的設計方式。

1. AlexNet模型架構

AlexNet中總共有8個捲積層：捲積、下取樣、捲積、下取樣、捲積、捲積、捲積、下取樣、全連接、全連接、全連接，基本上與目前主流的CNN架構沒有特別大的差異。跟LeNet-5中的特殊結構相比，沒有甚麼需要特別注意的地方，因此實作上會比較輕鬆。

# Define the AlexNet model
class AlexNet(nn.Module):
    def __init__(self, num_classes=10):
        super(AlexNet, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(64, 192, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(192, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )
        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
        self.classifier = nn.Sequential(
            nn.Dropout(),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = x.view(x.size(0), 256 * 6 * 6)
        x = self.classifier(x)
        return x

2. 完整程式碼

原版的AlexNet是訓練在ImageNet這個大型資料集上，然而這樣的訓練時間比較長，因此我們用個相對較小的資料集：CIFAR10來訓練模型，CIFAR10與MNIST一樣都是10個類別的分類任務，差別在MNIST的資料量較少，而且都是手寫數字，而CIFAR10則是貓貓狗狗、汽車飛機等物件。
由於訓練和評估的方式與上面相同，因此這邊就不另外討論了，直接提供完整程式碼給大家參考：

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms

# Define the AlexNet model
class AlexNet(nn.Module):
    def __init__(self, num_classes=10):
        super(AlexNet, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(64, 192, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(192, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )
        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
        self.classifier = nn.Sequential(
            nn.Dropout(),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = x.view(x.size(0), 256 * 6 * 6)
        x = self.classifier(x)
        return x

# Hyperparameters
batch_size = 64
learning_rate = 0.001
num_epochs = 10

# Data preprocessing and loading
transform = transforms.Compose([transforms.Resize((224,224)),
                                transforms.ToTensor(),
                                transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

train_dataset = torchvision.datasets.CIFAR10(root='./data', train=True, transform=transform, download=True)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)

test_dataset = torchvision.datasets.CIFAR10(root='./data', train=False, transform=transform)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Initialize the AlexNet model
model = AlexNet(num_classes=10).to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
total_step = len(train_loader)
for epoch in range(num_epochs):
    model.train()
    for i, (images, labels) in enumerate(train_loader):
        outputs = model(images.to(device))
        loss = criterion(outputs, labels.to(device))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if (i + 1) % 100 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{total_step}], Loss: {loss.item():.4f}')

# Evaluation
model.eval()
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Test Accuracy: {100 * correct / total}%')

和LeNet-5的程式碼有個地方不太一樣，這次我們使用了device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")這個東西，讓我們在硬體設備中有支援的GPU存在時可以使用GPU加速，這也是AlexNet與LeNet-5最大的不同：使用GPU加速訓練。